This EDA project on Hotel Booking Analysis investigates cancellations, and their underlying patterns; and suggests measures that can be implemented to reduce cancellations and secure revenue1:
The project covers booking information for a city hotel and a resort hotel including information such as when the booking was made, length of stay, the number of adults, children. The project went through the basic idea of the EDA and visualization process.
In this project I will do Exploratory Data Analysis on the given dataset. The project suggests measures that can be implemented to reduce cancellations and secure revenue. For example, hotels can offer discounts or promotions to customers who book early or who book for longer stays. Hotels can also offer incentives such as free parking or free breakfast to customers who book directly with them instead of through third-party websites.
This EDA involves following steps where in first step involves exploration and inspection over raw data, and second in second step I have dealt with data impurities and cleaned the data by andling null values and dropping irrelevent data from the dataset.
The project concludes that by analyzing hotel bookings data and understanding cancellations patterns, hotels can take steps to reduce cancellations and increase revenue.
Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions! This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data..
The project aims to gain interesting insight into customers’ behavior when booking a hotel. The demand for different segment of customer may differ and forecasting become harder as it may requires different model for different segment.These insights can guide hotels to adjust their customer strategies and make preparation for unknown.
# Importing necessary libraries needed in EDA
import numpy as np
import pandas as pd
# for visualisation
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# will be used for plotting
import plotly.express as px
#Loading the dataset
hb_df = pd.read_csv('Hotel Bookings.csv')
hb_df.shape
(119390, 32)
# Dataset First Look
hb_df
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.00 | 0 | 0 | Check-Out | 2015-07-01 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.00 | 0 | 0 | Check-Out | 2015-07-01 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.00 | 0 | 0 | Check-Out | 2015-07-02 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.00 | 0 | 0 | Check-Out | 2015-07-02 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.00 | 0 | 1 | Check-Out | 2015-07-03 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 119385 | City Hotel | 0 | 23 | 2017 | August | 35 | 30 | 2 | 5 | 2 | ... | No Deposit | 394.0 | NaN | 0 | Transient | 96.14 | 0 | 0 | Check-Out | 2017-09-06 |
| 119386 | City Hotel | 0 | 102 | 2017 | August | 35 | 31 | 2 | 5 | 3 | ... | No Deposit | 9.0 | NaN | 0 | Transient | 225.43 | 0 | 2 | Check-Out | 2017-09-07 |
| 119387 | City Hotel | 0 | 34 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | No Deposit | 9.0 | NaN | 0 | Transient | 157.71 | 0 | 4 | Check-Out | 2017-09-07 |
| 119388 | City Hotel | 0 | 109 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | No Deposit | 89.0 | NaN | 0 | Transient | 104.40 | 0 | 0 | Check-Out | 2017-09-07 |
| 119389 | City Hotel | 0 | 205 | 2017 | August | 35 | 29 | 2 | 7 | 2 | ... | No Deposit | 9.0 | NaN | 0 | Transient | 151.20 | 0 | 2 | Check-Out | 2017-09-07 |
119390 rows × 32 columns
#Looking first 5 rows of the datset
hb_df.head()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
5 rows × 32 columns
#Looking the last 5 rows of the dataset
hb_df.tail()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 119385 | City Hotel | 0 | 23 | 2017 | August | 35 | 30 | 2 | 5 | 2 | ... | No Deposit | 394.0 | NaN | 0 | Transient | 96.14 | 0 | 0 | Check-Out | 2017-09-06 |
| 119386 | City Hotel | 0 | 102 | 2017 | August | 35 | 31 | 2 | 5 | 3 | ... | No Deposit | 9.0 | NaN | 0 | Transient | 225.43 | 0 | 2 | Check-Out | 2017-09-07 |
| 119387 | City Hotel | 0 | 34 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | No Deposit | 9.0 | NaN | 0 | Transient | 157.71 | 0 | 4 | Check-Out | 2017-09-07 |
| 119388 | City Hotel | 0 | 109 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | No Deposit | 89.0 | NaN | 0 | Transient | 104.40 | 0 | 0 | Check-Out | 2017-09-07 |
| 119389 | City Hotel | 0 | 205 | 2017 | August | 35 | 29 | 2 | 7 | 2 | ... | No Deposit | 9.0 | NaN | 0 | Transient | 151.20 | 0 | 2 | Check-Out | 2017-09-07 |
5 rows × 32 columns
print(f'Number of rows : {len(hb_df.axes[0])}')
print(f'Number of rows : {len(hb_df.axes[1])}')
Number of rows : 119390 Number of rows : 32
hb_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 119390 entries, 0 to 119389 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 119390 non-null object 1 is_canceled 119390 non-null int64 2 lead_time 119390 non-null int64 3 arrival_date_year 119390 non-null int64 4 arrival_date_month 119390 non-null object 5 arrival_date_week_number 119390 non-null int64 6 arrival_date_day_of_month 119390 non-null int64 7 stays_in_weekend_nights 119390 non-null int64 8 stays_in_week_nights 119390 non-null int64 9 adults 119390 non-null int64 10 children 119386 non-null float64 11 babies 119390 non-null int64 12 meal 119390 non-null object 13 country 118902 non-null object 14 market_segment 119390 non-null object 15 distribution_channel 119390 non-null object 16 is_repeated_guest 119390 non-null int64 17 previous_cancellations 119390 non-null int64 18 previous_bookings_not_canceled 119390 non-null int64 19 reserved_room_type 119390 non-null object 20 assigned_room_type 119390 non-null object 21 booking_changes 119390 non-null int64 22 deposit_type 119390 non-null object 23 agent 103050 non-null float64 24 company 6797 non-null float64 25 days_in_waiting_list 119390 non-null int64 26 customer_type 119390 non-null object 27 adr 119390 non-null float64 28 required_car_parking_spaces 119390 non-null int64 29 total_of_special_requests 119390 non-null int64 30 reservation_status 119390 non-null object 31 reservation_status_date 119390 non-null object dtypes: float64(4), int64(16), object(12) memory usage: 29.1+ MB
# Dataset Duplicate Value Count
hb_df.duplicated().sum()
31994
There are 31994 duplicate values in the dataset
#Dropping the duplicate values
hb_df.drop_duplicates(inplace = True)
hb_df.shape
(87396, 32)
# Missing Values/Null Values Count
hb_df.isnull().sum()
hotel 0 is_canceled 0 lead_time 0 arrival_date_year 0 arrival_date_month 0 arrival_date_week_number 0 arrival_date_day_of_month 0 stays_in_weekend_nights 0 stays_in_week_nights 0 adults 0 children 4 babies 0 meal 0 country 452 market_segment 0 distribution_channel 0 is_repeated_guest 0 previous_cancellations 0 previous_bookings_not_canceled 0 reserved_room_type 0 assigned_room_type 0 booking_changes 0 deposit_type 0 agent 12193 company 82137 days_in_waiting_list 0 customer_type 0 adr 0 required_car_parking_spaces 0 total_of_special_requests 0 reservation_status 0 reservation_status_date 0 dtype: int64
# Visualizing the missing values using Seaborn heatmap
plt.figure(figsize=(25,12))
sns.heatmap(hb_df.isna().transpose(),
cmap="YlGnBu",
cbar_kws={'label': 'Missing Data'})
plt.title('Missing Values', fontsize=25)
plt.show()
We can see that there are total four columns with missing/null values : company, agent, country, children.
# Dataset Columns
hb_df.columns
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
'arrival_date_month', 'arrival_date_week_number',
'arrival_date_day_of_month', 'stays_in_weekend_nights',
'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
'country', 'market_segment', 'distribution_channel',
'is_repeated_guest', 'previous_cancellations',
'previous_bookings_not_canceled', 'reserved_room_type',
'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
'company', 'days_in_waiting_list', 'customer_type', 'adr',
'required_car_parking_spaces', 'total_of_special_requests',
'reservation_status', 'reservation_status_date'],
dtype='object')
# Dataset Describe
hb_df.describe()
| is_canceled | lead_time | arrival_date_year | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | booking_changes | agent | company | days_in_waiting_list | adr | required_car_parking_spaces | total_of_special_requests | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 87396.000000 | 87396.000000 | 87396.000000 | 87396.000000 | 87396.000000 | 87396.000000 | 87396.000000 | 87396.000000 | 87392.000000 | 87396.000000 | 87396.000000 | 87396.000000 | 87396.000000 | 87396.000000 | 75203.000000 | 5259.000000 | 87396.000000 | 87396.000000 | 87396.000000 | 87396.000000 |
| mean | 0.274898 | 79.891368 | 2016.210296 | 26.838334 | 15.815541 | 1.005263 | 2.625395 | 1.875795 | 0.138640 | 0.010824 | 0.039075 | 0.030413 | 0.183990 | 0.271603 | 94.138306 | 183.081384 | 0.749565 | 106.337246 | 0.084226 | 0.698567 |
| std | 0.446466 | 86.052325 | 0.686102 | 13.674572 | 8.835146 | 1.031921 | 2.053584 | 0.626500 | 0.455881 | 0.113597 | 0.193775 | 0.369145 | 1.731894 | 0.727245 | 113.188172 | 130.557608 | 10.015731 | 55.013953 | 0.281533 | 0.831946 |
| min | 0.000000 | 0.000000 | 2015.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | -6.380000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 11.000000 | 2016.000000 | 16.000000 | 8.000000 | 0.000000 | 1.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.000000 | 47.000000 | 0.000000 | 72.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 49.000000 | 2016.000000 | 27.000000 | 16.000000 | 1.000000 | 2.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 14.000000 | 169.000000 | 0.000000 | 98.100000 | 0.000000 | 0.000000 |
| 75% | 1.000000 | 125.000000 | 2017.000000 | 37.000000 | 23.000000 | 2.000000 | 4.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 240.000000 | 263.000000 | 0.000000 | 134.000000 | 0.000000 | 1.000000 |
| max | 1.000000 | 737.000000 | 2017.000000 | 53.000000 | 31.000000 | 19.000000 | 50.000000 | 55.000000 | 10.000000 | 10.000000 | 1.000000 | 26.000000 | 72.000000 | 21.000000 | 535.000000 | 543.000000 | 391.000000 | 5400.000000 | 8.000000 | 5.000000 |
Hotel : (Resort Hotel or City Hotel)
is_canceled: Value indicating if the booking was canceled (1) or not (0)
lead_time : Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year : Year of arrival date
arrival_date_month : Month of arrival date
arrival_date_week_number : Week number of year for arrival date
arrival_date_day_of_month : Day of arrival date
stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights : Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults : Number of adults
children : Number of children
babies : Number of babies
meal : Type of meal booked. Categories are presented in standard hospitality meal packages
country : Country of origin.` market_segment : Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel : Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest : Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type : Code of room type reserved. Code is presented instead of designation for anonymity reasons.
assigned_room_type : Code for the type of room assigned to the booking.
booking_changes : Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type : Indication on if the customer made a deposit to guarantee the booking.
agent : ID of the travel agency that made the booking
company : ID of the company/entity that made the booking or responsible for paying the booking.
days_in_waiting_list : Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type : Type of booking, assuming one of four categories
adr : Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces : Number of car parking spaces required by the customer
total_of_special_requests : Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status : Reservation last status, assuming one of three categories
Canceled – booking was canceled by the customer Check-Out – customer has checked in but already departed No-Show – customer did not check-in and did inform the hotel of the reason why reservation_status_date - Date at which the last status was set
# Check Unique Values for each variable.
pd.Series({col:hb_df[col].unique() for col in hb_df})
hotel [Resort Hotel, City Hotel] is_canceled [0, 1] lead_time [342, 737, 7, 13, 14, 0, 9, 85, 75, 23, 35, 68... arrival_date_year [2015, 2016, 2017] arrival_date_month [July, August, September, October, November, D... arrival_date_week_number [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 3... arrival_date_day_of_month [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... stays_in_weekend_nights [0, 1, 2, 4, 3, 6, 13, 8, 5, 7, 12, 9, 16, 18,... stays_in_week_nights [0, 1, 2, 3, 4, 5, 10, 11, 8, 6, 7, 15, 9, 12,... adults [2, 1, 3, 4, 40, 26, 50, 27, 55, 0, 20, 6, 5, 10] children [0.0, 1.0, 2.0, 10.0, 3.0, nan] babies [0, 1, 2, 10, 9] meal [BB, FB, HB, SC, Undefined] country [PRT, GBR, USA, ESP, IRL, FRA, nan, ROU, NOR, ... market_segment [Direct, Corporate, Online TA, Offline TA/TO, ... distribution_channel [Direct, Corporate, TA/TO, Undefined, GDS] is_repeated_guest [0, 1] previous_cancellations [0, 1, 2, 3, 26, 25, 14, 4, 24, 19, 5, 21, 6, ... previous_bookings_not_canceled [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... reserved_room_type [C, A, D, E, G, F, H, L, P, B] assigned_room_type [C, A, D, E, G, F, I, B, H, P, L, K] booking_changes [3, 4, 0, 1, 2, 5, 17, 6, 8, 7, 10, 16, 9, 13,... deposit_type [No Deposit, Refundable, Non Refund] agent [nan, 304.0, 240.0, 303.0, 15.0, 241.0, 8.0, 2... company [nan, 110.0, 113.0, 270.0, 178.0, 240.0, 154.0... days_in_waiting_list [0, 50, 47, 65, 122, 75, 101, 150, 125, 14, 60... customer_type [Transient, Contract, Transient-Party, Group] adr [0.0, 75.0, 98.0, 107.0, 103.0, 82.0, 105.5, 1... required_car_parking_spaces [0, 1, 2, 8, 3] total_of_special_requests [0, 1, 3, 2, 4, 5] reservation_status [Check-Out, Canceled, No-Show] reservation_status_date [2015-07-01, 2015-07-02, 2015-07-03, 2015-05-0... dtype: object
# creating a duplicate of the original dataset before making any changes in it
hb_df1 = hb_df.copy()
hb_df1.columns
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
'arrival_date_month', 'arrival_date_week_number',
'arrival_date_day_of_month', 'stays_in_weekend_nights',
'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
'country', 'market_segment', 'distribution_channel',
'is_repeated_guest', 'previous_cancellations',
'previous_bookings_not_canceled', 'reserved_room_type',
'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
'company', 'days_in_waiting_list', 'customer_type', 'adr',
'required_car_parking_spaces', 'total_of_special_requests',
'reservation_status', 'reservation_status_date'],
dtype='object')
# replacing null values in children column with 0 assuming that family had 0 children
# replacing null values in company and agent columns with 0 assuming those rooms were booked without company/agent
hb_df1['children' ].fillna(0, inplace = True)
hb_df1['company' ].fillna(0, inplace = True)
hb_df1['agent' ].fillna(0, inplace = True)
# replacing null values in country column as 'Others'
hb_df1['country'].fillna('Others', inplace = True)
# checking for null values after replacing them
hb_df1.isnull().sum()
hotel 0 is_canceled 0 lead_time 0 arrival_date_year 0 arrival_date_month 0 arrival_date_week_number 0 arrival_date_day_of_month 0 stays_in_weekend_nights 0 stays_in_week_nights 0 adults 0 children 0 babies 0 meal 0 country 0 market_segment 0 distribution_channel 0 is_repeated_guest 0 previous_cancellations 0 previous_bookings_not_canceled 0 reserved_room_type 0 assigned_room_type 0 booking_changes 0 deposit_type 0 agent 0 company 0 days_in_waiting_list 0 customer_type 0 adr 0 required_car_parking_spaces 0 total_of_special_requests 0 reservation_status 0 reservation_status_date 0 dtype: int64
# dropping the 'company' column as it contains a lot of null values in coparison to other columns
hb_df1.drop(['company'], axis =1 , inplace = True) # dropping the values vertically at axis 1 (columns)
# dropping rows where no adults , children and babies are available because no bookings were made that day
no_guest=hb_df1[hb_df1['adults']+hb_df1['babies']+hb_df1['children']==0]
hb_df1.drop(no_guest.index, inplace=True)
# adding some new columns to make our data analysis ready
hb_df1['total_people'] = hb_df1['adults'] + hb_df1['babies'] + hb_df1['children'] # creating total people column by adding all the people in that booking
hb_df1['total_stay'] = hb_df1['stays_in_weekend_nights'] + hb_df1['stays_in_week_nights'] # creating a column to check total stay by prople in that booking
# having a final look to check if our dataset is ready to analyse
hb_df1.head()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | agent | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | total_people | total_stay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | 0.0 | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 | 2.0 | 0 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | 0.0 | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 | 2.0 | 0 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | 0.0 | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 | 1.0 | 1 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | 304.0 | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 | 1.0 | 1 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | 240.0 | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 | 2.0 | 2 |
5 rows × 33 columns
hb_df1.tail()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | agent | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | total_people | total_stay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 119385 | City Hotel | 0 | 23 | 2017 | August | 35 | 30 | 2 | 5 | 2 | ... | 394.0 | 0 | Transient | 96.14 | 0 | 0 | Check-Out | 2017-09-06 | 2.0 | 7 |
| 119386 | City Hotel | 0 | 102 | 2017 | August | 35 | 31 | 2 | 5 | 3 | ... | 9.0 | 0 | Transient | 225.43 | 0 | 2 | Check-Out | 2017-09-07 | 3.0 | 7 |
| 119387 | City Hotel | 0 | 34 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | 9.0 | 0 | Transient | 157.71 | 0 | 4 | Check-Out | 2017-09-07 | 2.0 | 7 |
| 119388 | City Hotel | 0 | 109 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | 89.0 | 0 | Transient | 104.40 | 0 | 0 | Check-Out | 2017-09-07 | 2.0 | 7 |
| 119389 | City Hotel | 0 | 205 | 2017 | August | 35 | 29 | 2 | 7 | 2 | ... | 9.0 | 0 | Transient | 151.20 | 0 | 2 | Check-Out | 2017-09-07 | 2.0 | 9 |
5 rows × 33 columns
# checking the final shape of the dataset
print(f' final shape of the dataset is {hb_df1.shape}')
final shape of the dataset is (87230, 33)
# checking the unique values which is to be analysed
pd.Series({col:hb_df1[col].unique() for col in hb_df1})
hotel [Resort Hotel, City Hotel] is_canceled [0, 1] lead_time [342, 737, 7, 13, 14, 0, 9, 85, 75, 23, 35, 68... arrival_date_year [2015, 2016, 2017] arrival_date_month [July, August, September, October, November, D... arrival_date_week_number [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 3... arrival_date_day_of_month [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... stays_in_weekend_nights [0, 1, 2, 4, 3, 6, 13, 8, 5, 7, 12, 9, 16, 18,... stays_in_week_nights [0, 1, 2, 3, 4, 5, 10, 11, 8, 6, 7, 15, 9, 12,... adults [2, 1, 3, 4, 40, 26, 50, 27, 55, 20, 6, 5, 10, 0] children [0.0, 1.0, 2.0, 10.0, 3.0] babies [0, 1, 2, 10, 9] meal [BB, FB, HB, SC, Undefined] country [PRT, GBR, USA, ESP, IRL, FRA, Others, ROU, NO... market_segment [Direct, Corporate, Online TA, Offline TA/TO, ... distribution_channel [Direct, Corporate, TA/TO, Undefined, GDS] is_repeated_guest [0, 1] previous_cancellations [0, 1, 2, 3, 26, 25, 14, 4, 24, 19, 5, 21, 6, ... previous_bookings_not_canceled [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... reserved_room_type [C, A, D, E, G, F, H, L, B] assigned_room_type [C, A, D, E, G, F, I, B, H, L, K] booking_changes [3, 4, 0, 1, 2, 5, 17, 6, 8, 7, 10, 16, 9, 13,... deposit_type [No Deposit, Refundable, Non Refund] agent [0.0, 304.0, 240.0, 303.0, 15.0, 241.0, 8.0, 2... days_in_waiting_list [0, 50, 47, 65, 122, 75, 101, 150, 125, 14, 60... customer_type [Transient, Contract, Transient-Party, Group] adr [0.0, 75.0, 98.0, 107.0, 103.0, 82.0, 105.5, 1... required_car_parking_spaces [0, 1, 2, 8, 3] total_of_special_requests [0, 1, 3, 2, 4, 5] reservation_status [Check-Out, Canceled, No-Show] reservation_status_date [2015-07-01, 2015-07-02, 2015-07-03, 2015-05-0... total_people [2.0, 1.0, 3.0, 4.0, 5.0, 12.0, 40.0, 26.0, 50... total_stay [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 10, 11, 8, 9,... dtype: object
We can see that we have dealt with all the null values and added some new columns and now our dataset is ready to analysed.
Created a copy of the dataset before doing any manipulation then filled missing values with 0 in children , company and agent columns as those columns had numerical values and in column country filled missing values with 'others'. after dealing with missing values I dropped the country column as this had 96% missing values and was of no use in our analysis. In next step I created 2 new columns named 'total_people' and 'total_stay' for further analysis. In total people column I added all the babies, children and adults. similarly in second new column I added weekend stay and week stay column.
After doing all the manipulation I checked new manipulated dataset to check if this is ready to be analyzed.
After manipulating the dataset these were the insights I found:
1. There are 2 types of hotel which guests could book so I can find which type of hotel was booked most.
2. There are different types of guests and they come from different countries.
3. Guests can choose different foods from the menu.
4. Guests can book hotel directly or through different channels that are available.
5. Guests can cancel their booking and there are repeated guests also.
6. Guests can choose rooms of their liking while booking.
7. There is column available in the dataset named 'adr' which could be used to analyze hotel's performance on the basis of revenue.
# Chart - 1 visualization code
# Storing unique hotel names in a variable
hotel_name = hb_df1['hotel'].unique()
# Checking the number of unique booking in each hotel type
unique_booking = hb_df1.hotel.value_counts().sort_values(ascending=True)
# Creating a donut chart using plotly.express
fig1 = px.pie(names = hotel_name, values = unique_booking, hole = 0.5, color = hotel_name,
color_discrete_map={
'Resort Hotel': 'teal' , 'City Hotel' : 'nude'})
# Giving it a title and updating the text info
fig1.update_traces(textinfo = 'percent + value')
fig1.update_layout(title_text = 'Hotel Booking Percentage', title_x = 0.5)
# Setting the legend at center
fig1.update_layout(legend=dict(
orientation = 'h',
yanchor = 'bottom',
xanchor = 'center',
x = 0.5
))
# Display the figure
fig1.show()
# Count Hotel
hotel_count = hb_df1.hotel.value_counts()
# Plotting Values in a simple pie chart
hotel_count.plot.pie(figsize=(9,7), autopct='%1.2f%%', shadow=True, fontsize=15,startangle=50)
# Setting the title
plt.title('Hotel Booking Percentage')
plt.axis('equal')
plt.show()
I used Donut chart here because it is used to show the proportions of categorical data, with the size of each piece representing the proportion of each category.
I found out that guests prefer Resort Hotel most over City Hotel.
Are there any insights that lead to negative growth? Justify with specific reason.
This insight is useful for the stakeholder to check which hotel is performing best and they can invest more capitals in that. There is no such negative growth but stakeholders can focus more on City Hotel to get more booking and icrease the overall revenue.
# Chart - 2 visualization code
# Extracting and storing unique values of hotel cancelation
cancelled_hotel = hb_df1.is_canceled.value_counts()
# Craeting a pie chart
cancelled_hotel.plot.pie(figsize=(9,7), explode=(0.05,0.05), autopct='%1.2f%%', shadow=True, fontsize=15,startangle=50)
# Giving our pie chart a title
plt.title('Percentage of Hotel Cancellation and Non Cancellation')
plt.axis('equal')
plt.show()
I had to show a part-to-a-whole relationship and percentage of both the values and here pie chart was a good option to show segmented values.
Here we can see that around 72.48% bookins are not canceled by guests but around 27.52% bookings are canceled by guests.
Are there any insights that lead to negative growth? Justify with specific reason.
This insight will help stakeholders in comparing the cancellation and non cancellation of bookings. With the help of this insight stakeholders can offer rescheduling the bookings instead of cancellation and set a flexible cancellation policy to reduce booking cancellation.
# Chart - 3 visualization code
# Counting each meal type
meal_count = hb_df1.meal.value_counts()
# Extracting each meal type and storing in a variable
meal_name = hb_df1['meal'].unique()
# Creating a dataset of each meal type and count
meal_df = pd.DataFrame(zip(meal_name,meal_count), columns = ['meal name', 'meal count'])
# Visualising the values on a bar chart
plt.figure(figsize=(15,5))
g = sns.barplot(data=meal_df, x='meal name', y ='meal count')
g.set_xticklabels(meal_df['meal name'])
plt.title('Most preffered meal type', fontsize=25)
plt.show()
Meal type variable description:
BB - (Bed and Breakfast)
HB- (Half Board)
FB- (Full Board)
SC- (Self Catering)
There were 4 values to compare and Bar graphs are used to compare things between different groups that is why I used this chart.
After visualizing the above chart we can see that BB - (Bed and Breakfast) is the most preffered meal type by guests.
Are there any insights that lead to negative growth? Justify with specific reason.
Yes, from the gained insight above now stakeholders know that BB(Bed and Breakfast) is most preferred meal type so they can arrange raw material for this meal in advance and deliver the meal without any delay.
# Chart - 4 visualization code
# Plotting with countplot
plt.figure(figsize=(10,4))
sns.countplot(x=hb_df1['arrival_date_year'],hue=hb_df1['hotel'])
plt.title("Number of bookings across year", fontsize = 25)
plt.show()
Bar graphs are used to compare things between different groups that is why I used this chart.
From above insight I found out that hotel was booked most times in year 2016.
Are there any insights that lead to negative growth? Justify with specific reason.
Above insight shows that number of booking was declined after year 2016. Stakeholders can now what went wrong after 2016 and fix that problem to increase the umber of bookings. One way to do this is ask for feedbacks from guests and have a meeting with old employees who else were serving int the year 2016.
# Chart - 5 visualization code
plt.figure(figsize=(15,5))
sns.countplot(x=hb_df1['arrival_date_month'],hue=hb_df1['hotel'])
plt.title("Number of booking across months", fontsize = 25)
plt.show()
I had to compare values across the months and for that bar chart was one of the best choice.
Above insight shows that August and July ware 2 most busy months in compare to others.
Are there any insights that lead to negative growth? Justify with specific reason.
There is negative insight but hotel can use this insight to arrange everything in advance and welcome their guest in the best way possible and hotel can also run some promotional offer in these 2 months to attract more guests.
# Chart - 6 visualization code
# Coounting number of guests from various countries and changing column names
country_df = hb_df1['country'].value_counts().reset_index().rename(columns={'index': 'country','country': 'guests count'})[:10]
# Visualising the values on a bar chart
# setting the graph size
plt.figure(figsize=(15,4))
sns.barplot(x=country_df['country'], y=country_df['guests count'])
plt.title('Number of guests from each country', fontsize=20)
plt.show()
Here I comapred different values that's why I used bar chart.
From the above chart I found out that most guests come from PRT(Portugal).
Are there any insights that lead to negative growth? Justify with specific reason.
There is no negative insight. After knowing that most of the guests come from Portugal Hotels can add more Portugal cousines in their menu to make guests order more food.
# Visualization code
# Creating a datset of distribution channel name and count
dist_df = hb_df1['distribution_channel'].value_counts().reset_index()
# Renaming the columns to approproiate names
dist_df = dist_df.rename(columns={'index':'Channel name', 'distribution_channel':'channel count'})
# Creating a explode data
my_explode = (0.05,0.05,0.05,0.05,0.05)
#adding percentage columns to the distribution_channel_df
dist_df['percentage']=round(dist_df['channel count']*100/hb_df1.shape[0],1)
# Deciding the figure size
plt.figure(figsize=(15,6))
# Plotting the chart values
plt.pie(dist_df['channel count'], labels=None, explode = my_explode, startangle = 50)
# Adding legends with percenatge using list comprehension
labels = [f'{l}, {s}%' for l, s in zip(dist_df['Channel name'].value_counts().index.tolist(), dist_df['percentage'].values.tolist())]
plt.legend(bbox_to_anchor=(0.85, 1), loc='upper left', labels=labels)
# Setting the chart title
plt.title('Most Used Booking Distribution Channels by Guests' ,fontsize = 16)
# Show the chart
plt.axis('equal')
plt.show()
Pie chart is one of the best chart to visualize categoriacal data.
From the above insight it is clear that TA/TO (travel agents/Tour operators) is most used distribution channel by guests.
Are there any insights that lead to negative growth? Justify with specific reason.
There is no negative insight. Hotels can run promotional offers to motivate other channels to contribute more in bookings.
# Chart - 8 visualization code
# Setting the figure size
plt.figure(figsize=(15,5))
# Plotting the values in chart
sns.countplot(x=hb_df1['reserved_room_type'],order=hb_df1['reserved_room_type'].value_counts().index)
# Setting the title
plt.title('Preffered Room Type by Guests', fontsize = 20)
# Show the chart
plt.show()
A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent. It is often used to compare between values of different categories in the data.
By observing the above chart we can understand that the room type A most preffered ( almost 55,000) by the guests while booking the hotel.
Are there any insights that lead to negative growth? Justify with specific reason.
As it is clear that room type A is most used hotel should increase the number of A type room to maximize the revenue.
# Chart - 9 visualization code
# Setting the figure size
plt.figure(figsize=(15,5))
# Plotting the values
sns.countplot(x=hb_df1['assigned_room_type'], order = hb_df1['assigned_room_type'].value_counts().index)
# Setting the title
plt.title('Assigned Room Type to Guests', fontsize = 20)
# show the chart
plt.show()
A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent.
From the above chart it is clear that room type A is most assigned to guests.
Are there any insights that lead to negative growth? Justify with specific reason.
In the 8th chart we saw that around 55,000 guests preffered room type A but 45,000 people were assigned A type room. This could be a reason to cancel the bookings. Hotel could increase A type room to decrease cancellation.
# Chart - 10 visualization code
# Creating a dataset by grouping by agent column and it's count
agents = hb_df1.groupby(['agent'])['agent'].agg({'count'}).reset_index().rename(columns={'count':'Booking Count'}
).sort_values(by = 'Booking Count', ascending = False)
# Extracting top 5 agents by booking count
top_5 = agents[:5]
# Explosion
explode = (0.02,0.02,0.02,0.02,0.02)
# Colors
colors = ( "orange", "cyan", "brown", "indigo", "beige")
# Wedge properties
wp = { 'linewidth' : 1, 'edgecolor' : "green" }
# Creating autocpt arguments
def func(pct, allvalues):
absolute = int(pct / 100.*np.sum(allvalues))
return "{:.1f}%\n({:d} g)".format(pct, absolute)
# Plotting the values
fig, ax = plt.subplots(figsize =(15, 7))
wedges, texts, autotexts = ax.pie(top_5['Booking Count'],
autopct = lambda pct: func(pct, top_5['Booking Count']),
explode = explode,
shadow = False,
colors = colors,
startangle = 50,
wedgeprops = wp)
# Adding legend
ax.legend(wedges, top_5['agent'],
title ="agents",
loc ="upper left",
bbox_to_anchor =(1, 0, 0.5, 1))
plt.setp(autotexts, size = 8, weight ="bold")
ax.set_title("Top 5 agents in terms of booking", fontsize = 17)
# Show chart
plt.axis('equal')
plt.show()
A pie chart helps organize and show data as a percentage of a whole
We can see that agent number 9 has made the most number of bookings followed by agent number 240, 0, 14 and 7.
Are there any insights that lead to negative growth? Justify with specific reason.
Hotel can offer them bonus for their incredible work and to motivate them. This will help to increase the revenue.
# Chart - 11 visualization code
# Creating a variable containing guests with their repeated counts
rep_guests = hb_df1['is_repeated_guest'].value_counts()
# Plotting the values in a pie chart
rep_guests.plot.pie(autopct='%1.2f%%', explode=(0.00,0.09), figsize=(15,6), shadow=False)
# Setting the title
plt.title('Percentage of Repeated Guests', fontsize=20)
# Setting the chart in centre
plt.axis('equal')
# Show the chart
plt.show()
A pie chart helps organize and show data as a percentage of a whole
From the above insight we can see that 3.86% guests are repeated guests.
Are there any insights that lead to negative growth? Justify with specific reason.
We can see that number of repeated guests is very low and it shows negative growth of the hotel. Hotel can offer loyality discount to their guests to increase repeated guests.
# Chart - 12 visualization code
cust_type = hb_df1['customer_type'].value_counts()
# Plotting the values in a line chart
cust_type.plot(figsize=(15,5))
# Setting the x label , y label and title
plt.xlabel('Count', fontsize=8)
plt.ylabel('Customer Type', fontsize=10)
plt.title('Customer Type and their booking count', fontsize=20)
# Show the chart
plt.show()
Line graphs are used to track changes over different categories.
We can see that Transient customer type has most number of bookings.
Are there any insights that lead to negative growth? Justify with specific reason.
Hotel can run promotional offers to increase the number of bookings over other categories. such as hotel could offer discounts for groups.
# Chart - 13 visualization code
plt.figure(figsize=(15,5))
sns.countplot(x=hb_df1['market_segment'], order = hb_df1['market_segment'].value_counts().index)
plt.title('Market segment sahre in booking', fontsize=20)
plt.show()
A bar plot shows catergorical data as rectangular bars with the height of bars proportional to the value they represent.
Above insight shows that Online TA (Travel Agent) has the most bookings.
Are there any insights that lead to negative growth? Justify with specific reason.
There is no negative growth. Hotel should come up with some great idea to increase sahre among other market segments to increase the revenue.
Chart -14
# Visualization Code
# Counting each deposte type
deposite = hb_df1['deposit_type'].value_counts().index
# Setting the chart size
plt.figure(figsize=(8,4))
# plotting the values
sns.countplot(x=hb_df1['deposit_type'], order= deposite)
plt.title('Most used deposite type')
plt.show()
Chart - 15
# Chart - 11 visualization code
# Creating a not cancelled dataframe
not_cancelled_df = hb_df1[hb_df1['is_canceled'] == 0]
# Creating a hotel stay dataframe
hotel_stay = not_cancelled_df[not_cancelled_df['total_stay'] <= 15] #Visualizing pattern till 15days stay
# Setting plot size and plotting barchart
plt.figure(figsize = (15,5))
sns.countplot(x = hotel_stay['total_stay'], hue = hotel_stay['hotel'])
# Adding the label of the chart
plt.title('Total number of stays in each hotel',fontsize = 20)
plt.xlabel('Total stay')
plt.ylabel("Count of days")
plt.show()
From the above chart we can see that in City hotel most people stay for 3 days and in Resort hotel most people stay for only 1 day.
Hotel should work on to increase total stay in Resort hotel to increase revenue.
Chart-16
# Counting the revnue for each hotel type using groupby function
most_rev = hb_df1.groupby('hotel')['adr'].count()
# Plotting the values in a pie chart
most_rev.plot.pie(autopct='%1.2f%%', figsize=(15,5))
# Setting the title
plt.title('Percentage of daily revenue by each hotel type', fontsize=20)
plt.axis('equal')
# Show the chart
plt.show()
From the above insight it is clear that City hotel has more share in revenue generation over Resort Hotel.
Stake holderscould improve the service of Resort hotel so that people stay more in resort hotel and increase the revenue.
Chart - 17
# Grouping by hotel and taking the mean of days in waiting list
waiting_time_df = hb_df1.groupby('hotel')['days_in_waiting_list'].mean().reset_index()
# Waiting_time_df
# Setting the plot size
plt.figure(figsize=(8,4))
# Plotting the barchart
sns.barplot(x=waiting_time_df['hotel'],y=waiting_time_df['days_in_waiting_list'])
# Setting the labels
plt.xlabel('Hotel type',fontsize=12)
plt.ylabel('waiting time',fontsize=12)
plt.title("Waiting time for each hotel type",fontsize=20)
# Show chart
plt.show()
Above chart shows that City hotel has more waiting period. This could be because people stay more in City hotel as we saw in previous insight.
Stakeholders should increase rooms in City hotel or convert some of rooms of Resort hotel into City Hotel to decrease the waiting time.
Chart - 18
# Grouping hotel types on repeated guests
rep_guest = hb_df1[hb_df1['is_repeated_guest']==1].groupby('hotel').size().reset_index()
# Renaming the column
rep_guest = rep_guest.rename(columns={0:'number_of_repated_guests'})
# Setting the chart size
plt.figure(figsize=(8,4))
# Plotting the values in a bar chart
sns.barplot(x=rep_guest['hotel'],y=rep_guest['number_of_repated_guests'])
# Setting the labels and title
plt.xlabel('Hotel type', fontsize=12)
plt.ylabel('count of repeated guests', fontsize=12)
plt.title('Most repeated guests for each hotel', fontsize=20)
# Show Chart
plt.show()
We can see that Resort Hotel has slightly more repeated guests over City Hotel this could be because of less waiting time in Resort Hotel and better service there because of less rush.
Chart - 19
# Grouping arrival_month and hotel on mean of adr
bookings_months=hb_df1.groupby(['arrival_date_month','hotel'])['adr'].mean().reset_index()
# Creating a month list to order the months in ascending
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
# Creating a dataset of months, hotel and their adr
bookings_months['arrival_date_month']=pd.Categorical(bookings_months['arrival_date_month'],categories=months,ordered=True)
# Sorting the months
bookings_months=bookings_months.sort_values('arrival_date_month')
bookings_months
| arrival_date_month | hotel | adr | |
|---|---|---|---|
| 8 | January | City Hotel | 85.269875 |
| 9 | January | Resort Hotel | 49.181693 |
| 6 | February | City Hotel | 89.266427 |
| 7 | February | Resort Hotel | 54.102809 |
| 15 | March | Resort Hotel | 57.590889 |
| 14 | March | City Hotel | 95.193911 |
| 0 | April | City Hotel | 117.314134 |
| 1 | April | Resort Hotel | 79.283805 |
| 17 | May | Resort Hotel | 80.551101 |
| 16 | May | City Hotel | 128.055724 |
| 13 | June | Resort Hotel | 112.380859 |
| 12 | June | City Hotel | 123.996416 |
| 11 | July | Resort Hotel | 156.166914 |
| 10 | July | City Hotel | 120.318314 |
| 3 | August | Resort Hotel | 187.566659 |
| 2 | August | City Hotel | 125.148662 |
| 22 | September | City Hotel | 118.764693 |
| 23 | September | Resort Hotel | 100.892331 |
| 20 | October | City Hotel | 107.585401 |
| 21 | October | Resort Hotel | 63.723065 |
| 18 | November | City Hotel | 89.882912 |
| 19 | November | Resort Hotel | 48.871043 |
| 5 | December | Resort Hotel | 65.488671 |
| 4 | December | City Hotel | 93.204767 |
# Setting the chart size
plt.figure(figsize=(15,5))
# Plotting the values in a line chart
sns.lineplot(x=bookings_months['arrival_date_month'],y=bookings_months['adr'],hue=bookings_months['hotel'])
# Setting the labels and title
plt.title('ADR across each month', fontsize=20)
plt.xlabel('Month Name', fontsize=12)
plt.ylabel('ADR', fontsize=12)
# Show chart
plt.show()
City Hotel : It is clear that City Hotel generates more revenue in May months in comparison to other months.
Resort Hotel : Resort Hotel generates more revenue in between July and August months.
Stakeholders could prepare in advance for these 2 months as these 2 months generate more revenue.
Chart - 20
# Grouping dist_channel and hotels on their adr
dist_channel_adr = hb_df1.groupby(['distribution_channel','hotel'])['adr'].mean().reset_index()
# Setting the figure size
plt.figure(figsize=(15,5))
# Creating a horizontal bar chart
sns.barplot(x='adr', y='distribution_channel', data=dist_channel_adr, hue='hotel')
# Setting the title
plt.title('ADR across each distribution channel', fontsize=20)
# Show chart
plt.show()
GDS has contributed more in generating the ADR. GDS is a worldwide conduit between travel bookers and suppliers, such as hotels and other accommodation providers. It communicates live product, price and availability data to travel agents and online booking engines, and allows for automated transactions.
Direct- means that bookings are directly made with the respective hotels
TA/TO- means that booings are made through travel agents or travel operators.
Undefined- Bookings are undefined. may be customers made their bookings on arrival.
# Correlation Heatmap visualization code
# Setting the chart size
plt.figure(figsize=(15,10))
# Creating heatmap to see correlation of each columns
sns.heatmap(hb_df1.corr(numeric_only=True),annot=True) # Setting the numeric only colun to True to avoid warning
# Setting the title
plt.title('Correlation of the columns', fontsize=20)
# Show heatmap
plt.show()
Correlation heatmaps was used to find potential relationships between variables and to understand the strength of these relationships.
1) lead_time and total_stay is positively corelated. that means if customers stay more then the lead time increases.
2)adults,childrens and babies are corelated to each other. That means more the people more will be adr.
3) is_repeated guest and previous bookings not canceled has strong corelation. That means repeated guests don't cancel their bookings.**
# Pair Plot visualization code
sns.pairplot(hb_df1)
plt.show()
A pairs plot allows us to see both distribution of single variables and relationships between two variables .
We can see the realtionship between all the columns with each other in above chart.
Explain Briefly.
Inorder to achieve the business objective, i would suggest the client to make the price dynamic, introduce offers and packages to attract new customers. To retain the existing customers and ensure their repetition the client must introduce loyalty points program which can be redeemed by the customers in their next bookings. Amenities such as parking spaces, kids corner, free internet connection can be provided to increase the number of bookings.